“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.”
In other words, a machine learning algorithm learns from data
We would hope that the algorithm is able to perform well at a defined task given some data
Learning Algorithms
The task:
What is it that we’re trying to learn?
Classification
Regression
Some twist on these two
I’m going to say that these tasks can be broken down into (roughly) one of three not mutually exclusive types of tasks:
Prediction
Description
Inference/Explanation
Learning Algorithms
Inference/Explanation:
Learn how parts of a system effect one another.
Most common: Causal Inference
Learn how one (or more) variables in a system (e.g. a collection of features and outcomes) effect one another
In the regression context, loss for inference problems is defined as:
\mathcal L(\hat{\boldsymbol \beta}, \beta)
maybe mean squared error loss:
E[\hat{\boldsymbol \beta} - \boldsymbol \beta)^2]
Learning Algorithms
Note that explanation centers wholly on the values of coefficients!
Truthfully uncover the size of an effect
There is a subset of machine learning that centers on this concept
Causal machine learning
The “deconfounder” (Wang and Blei, 2019)
Causal Forests (Athey et al, 2019)
Matrix Completion for Panel Data (Athey, 2021)
Synthetic controls (Xu, 2018)
Causal Discovery (Friedman et al, 2000)
Interesting. But, not the focus of this class.
Learning Algorithms
Description
Given a complex set of features, mathematically represent those features in an understandable way
Most common: Dimensionality reduction and density estimation
Most frequently unsupervised methods (the experience)
We only see (or use) a set of features
There is no meaningful outcome to be trained on
No supervision and no real loss function(-ish)
Learning Algorithms
Dimensionality Reduction
We have \mathbf X, a N \times P feature matrix, where P is quite large. We believe that we can meaningfully represent X as \boldsymbol \Omega , a N \times M matrix where M << P.
Think PCA
Density Estimation
We have \mathbf X, a N \times P feature matrix. We would like to use these examples to make a guess as to the true P dimensional probabilitydistribution of the features - learn f(\mathbf x).
Think fitting a multivariate normal distribution to a collection of N samples like in QDA.
Learning Algorithms
Recall from last class that we have many tasks where we would like to learn f(\mathbf x) - the marginal distribution of the features
Not trying to use them to create predictions of an outcome, directly
These are largely descriptive tasks!
It’s not just PCA and matrix factorization methods.
There’s a lot of neat stuff here that we’ll leverage for generation methods
Learning Algorithms
Example: Word Embeddings
Let’s suppose that we want to perform some sort of analysis on a collection of texts
Problem: The set of “word” features can be quite high dimensional
Method: GLOVE
Converts a collection of tokens from documents to a lower dimensional space
Each token is represented in this lower dimensional space and is organized to place similar words close to one another and words that aren’t contextually similar far apart
For details, see PML 20.5.2
Learning Algorithms
from gensim.scripts.glove2word2vec import glove2word2vecfrom gensim.models import KeyedVectorsfrom sklearn.decomposition import PCAimport matplotlib.pyplot as plt# Convert the GloVe file to word2vec formatglove_input_file ='glove.6B.100d.txt'# replace with your GloVe file pathword2vec_output_file ='glove.6B.100d.txt.word2vec'glove2word2vec(glove_input_file, word2vec_output_file)# Load the modelmodel = KeyedVectors.load_word2vec_format(word2vec_output_file, binary=False)# Define a list of wordswords = ['king', 'queen', 'man', 'woman']# Get the embeddings for the wordsembeddings = [model[word] for word in words]# Use PCA to reduce the dimensionality to 2 dimensionspca = PCA(n_components=2)embeddings_2d = pca.fit_transform(embeddings)# Plot the 2D embeddingsplt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1])# Add labels to the pointsfor i, word inenumerate(words): plt.annotate(word, (embeddings_2d[i, 0], embeddings_2d[i, 1]))plt.show()
Learning Algorithms
Learning Algorithms
Description is a problem of learning!
The more examples we see, the better we are able to describe what we see.
It just requires its own language compared to predictive models
We’ll introduce this language as it arises!
Learning Algorithms
The most common type of task is predictive
Given some training data, \{\mathbf x_i , y_i\}_{i =1}^N, learn a function that maps the training features to the training outcomes in such a way that it will do a good job for an instance not included in the training data, \{\mathbf x_0 , y_0\}.
The class of tasks you are probably most comfortable with!
The goal of a predictive model is to learn from a set of training instances in a way that tells something about the true data generating process
Separates signal from noise.
Learning Algorithms
When we think predictive modeling, we are most often thinking about supervised learning
The features are associated with a true outcome that could be measured
We can compare the value of our prediction to the truth to judge how well our choice of model has worked
Most often, we’re going to assume that the experience that the algorithm receives is a single training set with all instance labeled
Each observation is associated with a true outcome
Learning Algorithms
This need not always be the case:
Semi-supervised learning: Some instances are labeled while others are not (large scale image classification)
Multi-instance learning: Learn if the training set contains at least one instance of the desired outcome without creating a prediction for each instance (Image of a beach - some sand instances, some water instances)
Reinforcement learning: A system constantly labels its experiences for the sake of learning to perform a task (AI teammates in DOTA)
Learning Algorithms
We say that a learning algorithm is a good one if it performs well
It does a good job of predicting the outcome for an instance that was not included in the training data
This is not well-defined without a mathematical definition of performance
With respect to what?
Predictive Performance
The generic predictive problem:
y = f(\mathbf x) + \epsilon
where f(x) is the signal that can be explained with the feature set and \epsilon is a random noise term that is zero centered.
Goal: Use the training data to learn about the signal function and ignore the noise
Why ignore the noise?
Predictive Performance
The problem is that the outcome we see isn’t conveniently separated into signal and noise
What we get is a mixture of the two
This is what we need to learn!
Some summary of the distribution over the possible outcomes!
Predictive Performance
We need a metric to define success.
The best way to do this is to leverage a proper loss function
A function that equates to zero if the arguments are equivalent
The further the two arguments are from one another, the higher the loss function
We use a learning method to create a guess for the data generating process - \hat{f}(\mathbf x)
And compare our guess to the truth that we observe - y = f(\mathbf x) + \epsilon
Predictive Performance
Most common loss functions:
For continuous outcomes, we most frequently use mean squared error loss. Given N examples that we would like to assess
which is just a reworking of the probability loss function from the previous slide.
Predictive Performance
Given a ground truth set of outcomes, it makes sense to:
Choose a loss function for our task
Choose the function that minimizes this loss
What we are given in a supervised learning problem:
A set of training features, \mathbf X
And the corresponding set of training labels, \mathbf y
The goal:
Uncover a function that does a good job of creating predictions for data that was not seen in the training set
Predictive Performance
Let’s start with a comfortable method: Decision Trees
As a review, given a feature set, \mathbf X, and a vector of classes, \mathbf y:
Start with the null model
Find the split on one feature that minimizes the classification loss on either side of the split; creates two partitions
Within each new partition, find the split that minimizes the loss; split again
Split until a fixed number of nodes is reached or until the tree puts every observation into its own node
Predictive Performance
Cross-entropy loss for decision trees requires using an empirical probability measure:
Within a node, say that the probability that a prediction within that partition belongs to each class corresponds to the empirical proportion of obs within the node that are in each class
Pretty simple!
Predictive Performance
Let’s start by creating a true data generating process.
Predictive Performance
Let’s generate some training data from the true data generating process - 5000 points
Predictive Performance
We can split any algorithm into the sets of knowns and unknowns.
For binary decision tree classifiers:
Knowns: \mathbf X and \mathbf y \in (0,1)
Unknowns: Number of terminal nodes in the tree
We need to pick the number of terminal nodes!
Let’s do this to create the most generalizable model.
Predictive Performance
Predictive Performance
Predictive Performance
According to the training loss, fit a tree with 500 terminal nodes!
Any thoughts as to why this is the case?
We know, though, that this isn’t what we’re after!
The predictive goal:
For data that was not used to train the model
Maximize the generalizability of the model
From the true DGP, we know that it is unlikely that it is possible to get these predictions 100% correct everywhere.
Predictive Performance
A reasonable metric of generalizability is the average loss across the entire space of possible feature combinations.
x \in ([-3,3] \times [-3,3])
Predictive Performance
Specifically, let’s assume that each \mathbf x and y are each i.i.d. draws from a common joint distribution, \mathcal T
If I was able to take another draw from \mathcal T, without knowing what the actual feature/outcome pair is, what is the probability that I make the incorrect choice?
Note that expectation of an indicator function is just probability of the event (think about it…)
Predictive Performance
Here, we’ll use a large set of data drawn from the same DGP to evaluate this expectation
Evaluate the class admitted by the trained decision tree classifier at 100,000 new points drawn from the same DGP
LLN says that this will get close to the expectation
Predictive Performance
Predictive Performance
In evaluating this expectation, the 500 node model is no longer the best!
The 75 node model is the best
And all of the lower node models are better than the 500 node model!
Predictive Performance
Predictive Performance
What’s going on here?
Why is 500 not the winner in terms of generalization?
Why is 2 not the winner?
Predictive Performance
We know that we’re overfitting with 500 nodes and underfitting with 2 nodes
What does overfitting mean?
What does underfitting mean?
Predictive Performance
This is all review.
Over the rest of this class and the next, I want to get real formal about our definitions of overfitting and underfitting.
How these terms relate to specific guarantees about training error and generalization error
How these terms relate to the size of the training data and the flexibility of the chosen method
Predictive Performance
A motivating example:
The previous exercise was done with N = 5000. What if we replicate the analysis for different training set sizes?
500 training samples
5000 training samples
50000 training samples
What do you think is going to happen to the optimal number of nearest neighbors in terms of generalization error?
Predictive Performance
Predictive Performance
Predictive Performance
Predictive Performance
Predictive Performance
Predictive Performance
Predictive Performance
With 500 samples, the best model in terms of generalization is very simple with just a few nodes
Just better to not try to really pick up complexity; just get some generic boxes and go with that
With 5000 samples, 75 nodes and a little shape
Still better to not try to account for too many curves
With 50,000 samples, a lot of nodes and a lot of complexity is optimal!
There appears to be a relationship between the sample size and how flexible the optimal model can be.
Generalization Error
To further explore this relationship, we’ll need to get technical about some definitions and assumptions.
Big Assumption #1
Let’s assume that we draw a training set of size N. Each \{\mathbf x_i , y_i\} is an independent and identically distributed draw from \mathcal T
Dictates all of the possible combinations that we could see and the probability with which we expect to see each one
A giant multivariate distribution, f(\mathbf x ,y)
Generalization Error
This i.i.d. assumption makes a lot of sense
If we want to use the training sample to learn about the way in which the outcome relates to the features, we need to assume that the training sample is representative of the true distribution
Can’t say anything about the general case if we don’t believe that our training sample represents the general case…
Generalization Error
In the supervised learning problem, we want to learn f(y | \mathbf x) - the conditional distribution of the outcome conditional on the features we observe
By assumption:
y = f(\mathbf x) + \epsilon
Some fixed function of the features plus a noise term
Under this construction, it can sometimes be useful to think about each y as a random variable since \mathbf x is assumed given
The y we observe conditional on the function and \mathbf x
The noise is variation in the outcome that is not explained by \mathbf x
The expectation of a conditional distribution is only a function of the givens.
Generalization Error
We use a learning method (KNN, linear regression, logistic regression, etc.) to create an approximation to f(\mathbf x)given the training data,\{\mathbf X_D , \mathbf y_D \}.
Let’s call this function \hat{f}(\mathbf x).
The prediction is generated from this function
For a given input, we’ll denote the prediction as \hat{y} = \hat{f}(\mathbf x)
Generalization Error
Using \hat{f}(\mathbf x), we can assess the training error
How well does the chosen model fit the data that was used to train it?
Expectations are nice to work with, so we’ll define the training error as:
E_D[\mathcal L(\mathbf y_D , \hat{\mathbf y}_D)]
\mathcal L(\cdot) is a proper loss function
This is the expected value over the data we observed in our observed training set
Since this is fixed and observed, we can substitute this for the empirical average
But, this is not our true loss function to minimize!
We want to understand how well our function fits any data that it might encounter.
This is where we need to go back to \mathcal T.
\mathcal T tells us any and all possibilities that we could see and the probability with which those different outcomes could be generated.
It includes all of the potential feature pairs we could observe.
It also includes all of the noise draws that we could observe conditional on the features.
Generalization Error
A good predictive function should work well for any possible feature/outcome combination we could potentially observe
e.g. be generally good
A reasonable measure of generalization error, then, is:
E_\mathcal T [ \mathcal L(y , \hat{y})]
Weighted for probability of occurrence, what is the average loss we would observe using \hat{f}(\mathbf x) for any possible feature/outcome combination we could observe?
Upweight likely occurrences (a man in retail,no trust fund, brown eyes, 5’8”)
Downweight unlikely occurrences (a man in finance, trust fund, blue eyes, 6’5”)
Generalization Error
We’re going to define i.i.d. generalization error as:
E_\mathcal T [ L(y , \hat{y})]
The expected loss with respect to the true data generating distribution.
The problem: We don’t know what \mathcal T is!
If we did, we’d be done here.
All we get are the N samples from \mathcal T that we observe in the training set.
Generalization Error
We don’t get to directly compute E_{\mathcal T} [\mathcal L(y , \hat{y})]
But, we do get to see E_D[\mathcal L(\mathbf y_D , \hat{\mathbf y}_D)]
Can we use the training error to inform us about the generalization error?
Or can we directly compute the generalization error?
Generalization Error
Let’s start with a generic truth: the generalization error will always be greater than or equal to the training error
We know that this is true.
Proving this is true is a bit difficult.
But, we can build a proof by making an assumption beyond the i.i.d. assumption
Convenience Assumption - Fixed X
Let’s assume that we have two different i.i.d. data sets pulled from \mathcal T of size N, \{\mathbf X_D , \mathbf y_D\} and \{\mathbf X_D , \mathbf y_D'\}
The training features are the same
But the actual outcomes differ due to different noise draws
The distribution (and moments) over y and y' are the same under this construction. Additionally, the predictions don’t have anything to do with y'. So we can further reduce to:
As the number of implied features increases, the covariance between the training outcomes and the predictions increases
Put another way, the more implied features in the model, the more the prediction at \mathbf x depends on the value of the outcome at that point and the less it depends on the the value of the outcome at surrounding points
For the polynomial model, as the degree increases, more terms.
This makes sense because we’re overresponding to wiggliness in the training data
Overfitting
This effect is diminished asN gets larger
Generalization Error
Three factors that effect generalization performance
Training error
The covariance of training outcomes and predictions
The sample size
Lower any of these (or increase the sample size) and decrease the generalization error!
Generalization Error
Generated from a 3rd degree polynomial with \sigma^2 = 9.
Generalization Error
Generalization Error
Generalization Error
This is great!
We understand a little about the link between training error and the generalization error
But, this result really only applies when we can compute the covariance of the prediction and the outcome
What is this covariance for an SVM with an RBF kernel?
What is this covariance for a random forest?
To make more headway here, we need to present more general results.
Next Time
The bias/complexity tradeoff!
It’s always there (but goes away as N \rightarrow \infty)
Deep learning is cool and all, but we need a lot of data to justify the flexibility…